1. Introduction

In this project, we have 3 tasks. In the first task, we will explore relationships among odds and over/under scores. To explore, we will firstly apply PCA analysis, then will do MDS. Finally, we will conclude by comparing PCA and MDS results.

In the second part, we will make the same analysis for home, away and tie scores.

In the last task, we will do some image processing. We will try to compress images by using PCA.

2. Task 1 & 2

2.1 Data Preprocessing

First, I will install the libraries and data I will use.

require(data.table)
## Loading required package: data.table
require(anytime)
## Loading required package: anytime
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
setwd("C:/IE_582_Rep/fall18-bugracnr/HW2")

#save paths
matches_file_path = "HW2_Files/df9b1196-e3cf-4cc7-9159-f236fe738215_matches.rds"
odd_details_file_path = "HW2_Files/df9b1196-e3cf-4cc7-9159-f236fe738215_odd_details.rds"

#load data
matches=readRDS(matches_file_path)
odds=readRDS(odd_details_file_path)

Now, I need to process matches data. I will add over/under & 1x2 results at the end.

matches[,match_time:=anytime(date)]
matches[,Year:=year(match_time)]
matches[,c("match_time","date","leagueId","type"):=NULL]

#Over Under & 1x2 Results
matches[,c("HomeGoals","AwayGoals"):=tstrsplit(score,':')]
matches$HomeGoals=as.numeric(matches$HomeGoals)
matches[,AwayGoals:=as.numeric(AwayGoals)]
matches[,TotalGoals:=HomeGoals+AwayGoals]
matches[,IsOver:=0]
matches[TotalGoals>2,IsOver:=1]
matches[,Is1 := HomeGoals > AwayGoals]
matches[,Is2 := HomeGoals < AwayGoals]
matches[,IsX := HomeGoals == AwayGoals]
matches[,results := Is1*1 + Is2*2 + IsX*3]
matches <- unique(matches)
matches <- matches[complete.cases(matches)]

I need to process odds data as well. I will filter the bookmakers and the bettypes that I will use. I will remove asian handicap, as I won’t include it in my analysis. Also, I will only use 2.5 handicap in over under bets.

bmakers <- c("Betsafe", "12BET", "bet365", "Betclic", "Pinnacle")
bettypes <- unique(odds$betType)
bettypes <- bettypes[c(-2, -6)]

odds_ou <- odds%>% filter(bookmaker %in% bmakers, betType == "ou", totalhandicap == "2.5") %>% mutate() 
odds <- odds %>% filter(bookmaker %in% bmakers, betType %in%bettypes)  %>% mutate()
odds <- rbind(odds,odds_ou)
odds <- as.data.table(odds)
odds <- odds[order(matchId, oddtype,bookmaker,date)]
odds <- unique(odds)

Now, I need to create my feature vector for the analysis. I will transform my data into wide format. I will also create a color dataset for later use.

odds_final=odds[,list(final_odd=odd[.N]),
                              by=list(matchId,oddtype,bookmaker)]
wide_final <- dcast(odds_final,
                            matchId ~ bookmaker + oddtype,
                            value.var="final_odd")

wide_final <- wide_final[complete.cases(wide_final)]
wide_final <- unique(wide_final)
wide_final <- wide_final[matchId %in% matches$matchId]

match_colors <- merge(wide_final[,"matchId"], matches[,c("matchId", "IsOver", "results")], by = "matchId")

2.2 Task 1.a & 2: PCA Analysis

I will firstly apply PCA on my data. Then, draw a biplot to get some insight.

pca <- princomp(wide_final[,2:ncol(wide_final)])
biplot(pca, xlab = "Component 1", ylab = "Component 2")

As there are plenty of features, biplot seems confusing. However, we can observe that odd1 and odd2 features for all bookmakers will be meaningful for explaining variance.

Now, I will explore the proportion of variance explained by the components.

summary(pca)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3      Comp.4
## Standard deviation     6.3347177 3.0362735 0.73170087 0.245105738
## Proportion of Variance 0.7981168 0.1833554 0.01064827 0.001194865
## Cumulative Proportion  0.7981168 0.9814722 0.99212046 0.993315327
##                            Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     0.23531463 0.2099085301 0.1907979296 0.1649209568
## Proportion of Variance 0.00110131 0.0008763389 0.0007240345 0.0005409581
## Cumulative Proportion  0.99441664 0.9952929764 0.9960170109 0.9965579690
##                              Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     0.1523393470 0.1443497729 0.1317663531 0.1264249279
## Proportion of Variance 0.0004615684 0.0004144233 0.0003453194 0.0003178904
## Cumulative Proportion  0.9970195374 0.9974339607 0.9977792801 0.9980971704
##                            Comp.13      Comp.14      Comp.15    Comp.16
## Standard deviation     0.115338466 0.1047918740 0.0998116741 0.08794570
## Proportion of Variance 0.000264582 0.0002184073 0.0001981411 0.00015383
## Cumulative Proportion  0.998361752 0.9985801597 0.9987783008 0.99893213
##                             Comp.17      Comp.18      Comp.19      Comp.20
## Standard deviation     0.0853795671 0.0794535979 0.0771229666 0.0728721751
## Proportion of Variance 0.0001449839 0.0001255565 0.0001182985 0.0001056174
## Cumulative Proportion  0.9990771147 0.9992026712 0.9993209697 0.9994265871
##                             Comp.21      Comp.22      Comp.23      Comp.24
## Standard deviation     6.999944e-02 6.559303e-02 6.290298e-02 4.986978e-02
## Proportion of Variance 9.745432e-05 8.557113e-05 7.869631e-05 4.946372e-05
## Cumulative Proportion  9.995240e-01 9.996096e-01 9.996883e-01 9.997378e-01
##                             Comp.25      Comp.26      Comp.27      Comp.28
## Standard deviation     4.834428e-02 4.538078e-02 4.173257e-02 3.666586e-02
## Proportion of Variance 4.648385e-05 4.095962e-05 3.463875e-05 2.673841e-05
## Cumulative Proportion  9.997843e-01 9.998252e-01 9.998599e-01 9.998866e-01
##                             Comp.29      Comp.30      Comp.31      Comp.32
## Standard deviation     3.193757e-02 2.905609e-02 2.501528e-02 2.356777e-02
## Proportion of Variance 2.028689e-05 1.679138e-05 1.244579e-05 1.104711e-05
## Cumulative Proportion  9.999069e-01 9.999237e-01 9.999361e-01 9.999472e-01
##                             Comp.33      Comp.34      Comp.35      Comp.36
## Standard deviation     2.290690e-02 2.242541e-02 2.030282e-02 1.799915e-02
## Proportion of Variance 1.043625e-05 1.000213e-05 8.198318e-06 6.443413e-06
## Cumulative Proportion  9.999576e-01 9.999676e-01 9.999758e-01 9.999822e-01
##                             Comp.37      Comp.38      Comp.39      Comp.40
## Standard deviation     1.692460e-02 1.417567e-02 1.187359e-02 1.073797e-02
## Proportion of Variance 5.697034e-06 3.996680e-06 2.803986e-06 2.293275e-06
## Cumulative Proportion  9.999879e-01 9.999919e-01 9.999947e-01 9.999970e-01
##                             Comp.41      Comp.42
## Standard deviation     9.616229e-03 7.522060e-03
## Proportion of Variance 1.839168e-06 1.125345e-06
## Cumulative Proportion  9.999989e-01 1.000000e+00

As you can see in the above table, component 1 & 2 explain 98% of total variance. Between them, component 1 explains 79.8% of total variance while component 2 is explaining the remaining 18.33%. Now, I will visualize that information in a histogram.

plot(pca)

As we can see in the histogram, almost all of the variance is explained by the first two components. Now, let’s see the relationship between components with the features.

pca$loadings[,1:2]
##                      Comp.1      Comp.2
## 12BET_odd1      0.208553151  0.29362894
## 12BET_odd2     -0.317851947  0.16647243
## 12BET_oddX     -0.036940659  0.16797079
## 12BET_over      0.004822089 -0.03624817
## 12BET_under    -0.005178460  0.03580004
## Betclic_12      0.002843425 -0.01540651
## Betclic_1X      0.058684500  0.07783210
## Betclic_X2     -0.084016491  0.03664916
## Betclic_odd1    0.206446318  0.29281317
## Betclic_odd2   -0.313932904  0.16132035
## Betclic_oddX   -0.031102880  0.15183243
## Betclic_over    0.004875268 -0.03710045
## Betclic_under  -0.005435775  0.03695075
## Betsafe_1       0.150977359  0.24013545
## Betsafe_12      0.002797074 -0.01595053
## Betsafe_1X      0.060072224  0.08142236
## Betsafe_2      -0.243580156  0.15509054
## Betsafe_NO      0.006743957 -0.01897988
## Betsafe_X2     -0.087118368  0.04005760
## Betsafe_YES    -0.006636634  0.01986085
## Betsafe_odd1    0.217663019  0.31589811
## Betsafe_odd2   -0.330777930  0.17629491
## Betsafe_oddX   -0.034922286  0.17631936
## Betsafe_over    0.004653657 -0.03583008
## Betsafe_under  -0.004956519  0.03712387
## Pinnacle_odd1   0.226574993  0.32812521
## Pinnacle_odd2  -0.355167017  0.19488515
## Pinnacle_oddX  -0.037140010  0.18469073
## Pinnacle_over   0.004994760 -0.03825571
## Pinnacle_under -0.005515512  0.03849300
## bet365_1        0.146302379  0.23322588
## bet365_12       0.002682628 -0.01534460
## bet365_1X       0.060016586  0.08203813
## bet365_2       -0.247183088  0.16905376
## bet365_NO       0.006232554 -0.01973451
## bet365_X2      -0.087439518  0.04044144
## bet365_YES     -0.006501010  0.02114563
## bet365_odd1     0.225605643  0.32596217
## bet365_odd2    -0.353181066  0.19557814
## bet365_oddX    -0.038041222  0.18008537
## bet365_over     0.004343057 -0.03494761
## bet365_under   -0.004729752  0.03479902

The table above shows that, the most contributed features for component 1 are the odd1, 1, odd2 and 2 features of all bookmakers. This is not surprising as we observed this fact in the biplot. 2nd component is mostly contributed by odd1, odd2, oddX, 1, and 2 features of all bookmakers.

Now, let’s try to see if there is a relationship between the results of PCA and the match results.

par(mfrow = c(1,2))
plot(pca$scores, col = match_colors$IsOver + 2, xlab = "Component 1", ylab = "Component 2", main = "Over/Under & PCA Results")
legend("bottomright",cex = 0.55, legend = c("Under", "Over"), 
       fill = 2:3)
plot(pca$scores, col = match_colors$results + 1, xlab = "Component 1", ylab = "Component 2", main = "Home/Away/Tie & PCA Results")
legend("bottomright",cex = 0.55, legend = c("Home", "Away", "Tie"), 
       fill = 2:4)

On the left hand side, you can see the relationship between PCA and over under results. As it can be observed, the results are distributed randomly, and there is no significant relationship. On the right hand side, the same plot is colored with Home, Away and Tie results for the Task 2. Away scores are mostly collected on the right hand side of the plot, while home scores are collected in the opposite side of the graph.

2.3 Task 1.b & 1.c

In this part, I will apply MDS analysis with Euclidean and Manhattan distances to the same data.

First, I need to create distance matrices. Then, I will run the analysis, and plot the results.

manhattan_distances <- dist(na.omit(wide_final[,2:ncol(wide_final)]), method = "manhattan")
euclidean_distances <- dist(na.omit(wide_final[,2:ncol(wide_final)]), method = "euclidean")

fit_manhattan <- cmdscale(manhattan_distances)
fit_euclidean <- cmdscale(euclidean_distances)
par(mfrow = c(1,2))
plot(fit_manhattan[,1],fit_manhattan[,2],main='MDS with Manhattan Distances',xlab='', ylab='',col= match_colors$IsOver + 2)
legend("bottomleft",cex = 0.55, legend = c("Under", "Over"), 
       fill = 2:3)
plot(fit_euclidean[,1],fit_euclidean[,2],main='MDS with Euclidean Distances',xlab='', ylab='',col= match_colors$IsOver + 2)
legend("bottomleft",cex = 0.55, legend = c("Under", "Over"), 
       fill = 2:3)

In both of the MDS results, shapes are reverse of PCA results. In Manhattan distance results, the data more sparsely distributed along an opposite V shape, while Euclidean results are more dense. In both of the plots, we observe no significant grouping of the results. Both of them are distributed randomly.

3. Task 3

3.1 Reading Image

Firstly, I will install libraries and data. The picture that I will process will be the famous Renaissance artist Raphael’s School of Athens fresco.

library(jpeg)
library(imager)
## Loading required package: magrittr
## 
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
## 
##     add
## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum
## The following object is masked from 'package:graphics':
## 
##     frame
## The following object is masked from 'package:base':
## 
##     save.image
library(data.table)



path <- "C:/IE_582_Rep/fall18-bugracnr/HW2"

setwd(path)

pic <- readJPEG("HW2_Files/HW2_Part3.jpeg")

First, let’s take a look at some information about the image.

class(pic)
## [1] "array"
typeof(pic)
## [1] "double"
str(pic)
##  num [1:512, 1:512, 1:3] 0.137 0.129 0.125 0.125 0.129 ...
dim(pic)
## [1] 512 512   3
range(pic)
## [1] 0 1

The image is an array, whose entries are doubles. The dimensions are 5125123 (512*512 for pixels & 3 for rgb channels). The range of values is distributed between 0 & 1.

3.1 Displaying Image

Now, let’s display the image. I will use rasterImage function for that.

par(mfrow = c(1,1))
x <- 1:512
y <- 1:512
plot(x,y, ann = FALSE, axes = FALSE, col = 0)
rasterImage(pic, 0,0,512,512)

The painting is displayed above. Now, let’s display each channel seperately. To do that, I will first create 3 matrices seperately for each channel, and create color palettes. Then, I will display.

r <- t(apply(pic[,,1],2,rev))
g <- t(apply(pic[,,2],2,rev))
b <- t(apply(pic[,,3],2,rev))

redPal <- colorRampPalette(c("black", "red"))
greenPal <- colorRampPalette(c("black", "green"))
bluePal <- colorRampPalette(c("black", "blue"))

par(mfrow = c(1,3))
image(r, col = redPal(256), ann = TRUE, axes = FALSE, main = "R Channel" , xlab = "",ylab = "")
image(g, col = greenPal(256), ann = TRUE, axes = FALSE, main = "G Channel" , xlab = "",ylab = "")
image(b, col = bluePal(256), ann = TRUE, axes = FALSE, main = "B Channel", xlab = "",ylab = "" )

3.3 Adding Noise

Now, I will add noise to the picture. First, I will add noise each of the channels. Then, I will normalize them, and at last I will display the noisy image side-by-side with the original image.

set.seed(1)
noise_1 <-  matrix(runif(512*512, min = 0, max = 0.1),512)
set.seed(2)
noise_2 <-  matrix(runif(512*512, min = 0, max = 0.1),512)
set.seed(3)
noise_3 <-  matrix(runif(512*512, min = 0, max = 0.1),512)

noisy_pic <- pic

noisy_pic[,,1] <- pic[,,1] + noise_1
noisy_pic[,,2] <- pic[,,2] + noise_2
noisy_pic[,,3] <- pic[,,3] + noise_3

noisy_pic <- noisy_pic/max(noisy_pic)


noisy_r <- t(apply(noisy_pic[,,1],2,rev))
noisy_g <- t(apply(noisy_pic[,,2],2,rev))
noisy_b <- t(apply(noisy_pic[,,3],2,rev))
par(mfrow = c(1,2))
plot(x,y, axes = FALSE, col = 0, xlab = "",ylab = "", main = "Noisy Image")
rasterImage(noisy_pic, 0,0,512,512)
plot(x,y, axes = FALSE ,xlab = "",ylab = "", col = 0, main = "Original Image")
rasterImage(pic, 0,0,512,512)

Now, let’s display channel’s of the noisy image side-by-side.

par(mfrow = c(1,3))
image(noisy_r, col = redPal(256), ann = TRUE, axes = FALSE, main = "R Channel" , xlab = "",ylab = "")
image(noisy_g, col = greenPal(256), ann = TRUE, axes = FALSE, main = "G Channel" , xlab = "",ylab = "")
image(noisy_b, col = bluePal(256), ann = TRUE, axes = FALSE, main = "B Channel", xlab = "",ylab = "" )

3.4 Processing the Noisy Image

3.4.1 Transformation to Greyscale & PCA

To start processing, I will transform my image to greyscale. I will use a straightforward approach to do that. I will simply take averages of each channel and normalize the result. Then, I will display the greyscale image.

noisy_grey <- noisy_pic[,,1] + noisy_pic[,,2] + noisy_pic[,,3]
noisy_grey <- noisy_grey/max(noisy_grey)  
par(mfrow = c(1,1))
plot(x,y, main = "Greyscale Image", xlab = "", ylab = "", axes = FALSE, col = 0)
rasterImage(noisy_grey,0,0,512,512)

Now, I will continue with the PCA analysis. I will use 3x3 channels in my analysis. There will be 510*510 patches.

To do PCA, I will need a feature vector. The feature vector will be the vector of patches. The 3x3 patches will be transformed into vectors with length 9, for each of 510x510 patches. To do that operation, I will firstly allocate a memory for the vectors for the sake of computational speed. Then, the transformation will be done in a loop. Finally, the resulting list of vectors will be transformed into a data table for further operations.

sample_vector <- rep(NA,9)
imgdata <- rep(list(sample_vector), 510*510)

k <- 1

for (i in 2:511) {
  for (j in 2:511) {
    imgdata[[k]] <- as.vector(noisy_grey[(i-1):(i+1),(j-1):(j+1)])
    k <- k+1
  }
}

imgdata <- as.data.table(matrix(unlist(imgdata), ncol = 9, byrow = TRUE))

Now, I have the data with dimensions (510x510)x9. I can simply apply PCA on that data.

noisy_pca <- princomp(imgdata)
plot(noisy_pca)

summary(noisy_pca)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     0.6022118 0.10525492 0.08429654 0.06455817
## Proportion of Variance 0.9176954 0.02803398 0.01798123 0.01054636
## Cumulative Proportion  0.9176954 0.94572936 0.96371059 0.97425695
##                             Comp.5      Comp.6      Comp.7      Comp.8
## Standard deviation     0.055242978 0.048601123 0.044822342 0.038946357
## Proportion of Variance 0.007722434 0.005977129 0.005083808 0.003838254
## Cumulative Proportion  0.981979384 0.987956513 0.993040321 0.996878576
##                             Comp.9
## Standard deviation     0.035121770
## Proportion of Variance 0.003121424
## Cumulative Proportion  1.000000000

As it can be observed from the above histogram and table, component 1 explains 91.7% of all variance by itself, which is a huge proportion. Second and third components explain 2.8% and 1.8% of variance respectively.

3.4.2 Reconstruction

As the components converts pixel values in the original image into the scores of PCA, it is possible to reconstruct the image from the scores. We expect to see higher variance explaining components will reconstruct the image closer to the unprocessed noisy image.

To reconstruct the images, we need to process scores. Scores are firstly normalized, then, turned into matrices. At last, they are displayed by the rasterImage function.

score_1 <- noisy_pca$scores[,1]
score_2 <- noisy_pca$scores[,2]
score_3 <- noisy_pca$scores[,3]

score_1 <- (score_1 - min(score_1))/(max(score_1)-min(score_1))
score_2 <- (score_2 - min(score_2))/(max(score_2)-min(score_2))
score_3 <- (score_3 - min(score_3))/(max(score_3)-min(score_3))


mat_score_1 <- matrix(score_1, 510, byrow=TRUE)
mat_score_2 <- matrix(score_2, 510, byrow=TRUE)
mat_score_3 <- matrix(score_3, 510, byrow=TRUE)
par(mfrow = c(1,2))
plot(1:512,1:512, ann = TRUE, axes = FALSE, col = 0, main = "Unprocessed Image", xlab = "", ylab = "")
rasterImage(noisy_grey,0,0,512,512)
plot(x,y, ann = TRUE, axes = FALSE, col = 0, main = "Component 1 Reconstruction", xlab = "", ylab = "")
rasterImage(mat_score_1,0,0,510,510)

par(mfrow = c(1,2))
plot(x,y, ann = TRUE, axes = FALSE, col = 0, main = "Component 2 Reconstruction", xlab = "", ylab = "")
rasterImage(mat_score_2,0,0,510,510)
plot(x,y, ann = TRUE, axes = FALSE, col = 0, main = "Component 3 Reconstruction", xlab = "", ylab = "")
rasterImage(mat_score_3,0,0,510,510)

A it can be seen above, the reconstruction of component 1 is very close to the unprocessed image. The others are only shows the silhouettes.

3.4.3 Eigenvector Analysis

As the eigenvectors are in charge of transformation, we can get valuable information from them. We will plot eigenvectors as images to see how they affect the patches.

To do that operation, I will firstly normalize the eigenvectors, then convert them into matrices. At lasti I will display them.

eigen_1 <- noisy_pca$loadings[,1]
eigen_2 <- noisy_pca$loadings[,2]
eigen_3 <- noisy_pca$loadings[,3]

eigen_1 <- (eigen_1 - min(eigen_1))/(max(eigen_1)-min(eigen_1))
eigen_2 <- (eigen_2 - min(eigen_2))/(max(eigen_2)-min(eigen_2))
eigen_3 <- (eigen_3 - min(eigen_3))/(max(eigen_3)-min(eigen_3))

eigenmat_1 <- matrix(eigen_1,3)
eigenmat_2 <- matrix(eigen_2,3)
eigenmat_3 <- matrix(eigen_3,3)

noisy_pca$loadings[,1:3]
##       Comp.1       Comp.2        Comp.3
## V1 0.3310116  0.379955319  3.981413e-01
## V2 0.3340084  0.438978660  1.167867e-02
## V3 0.3307110  0.401698240 -3.742941e-01
## V4 0.3352950 -0.013109732  4.481133e-01
## V5 0.3388466  0.000681731  9.558156e-05
## V6 0.3353159  0.014510517 -4.477142e-01
## V7 0.3303535 -0.402666147  3.750868e-01
## V8 0.3336701 -0.440290509 -1.213068e-02
## V9 0.3306875 -0.381040120 -3.989489e-01
greyPal <- colorRampPalette(c("black","grey"))

par(mfrow=c(1,3))
image(eigenmat_1, col = greyPal(256), ann = TRUE, axes = FALSE, main = "Eigenvector 1", xlab = "", ylab = "")
image(eigenmat_2, col = greyPal(256), ann = TRUE, axes = FALSE, main = "Eigenvector 2", xlab = "", ylab = "")
image(eigenmat_3, col = greyPal(256), ann = TRUE, axes = FALSE, main = "Eigenvector 3", xlab = "", ylab = "")

As it can be observed in the above images and the table, first eigenvector is distributed uniformly. Which means, all of the pixels in patches has similar weights. This results in better reconstructionof the image. The second and third eigenvectors has similar weights in the same rows and columns respectively. This resulted in the gradient view in the displays of eigenvectors, and also the shady display of reconstructions.

4. Conclusion

In this project, I worked on odd data in part 1&2, and image data in part 3. I mainly used PCA, and MDS analysis in the first two parts. There were no significant relationship found between the feature vector and the results. In the image processing part I only used PCA. I was able to successfully load and process the image. I also applied PCA on the image data and was able to successfully reconstruct the image.